Isolation Forest

Anomaly Detection of Time-Series Data


READ - I've created a python script to allow the user to click the button and then they can either have all of the underyling code shown, OR they can just look at the raw output (charts, plots, whatever).
As you know, sometimes these notebooks contain a fair amount of code... and sometimes folks just want the results... here is an example

Code defaults to NOT showing any code, so click the toggle button to view the show the underlying code...


I will keep the seperate timestamp col T for my plots

Split Data into over a year
Graphing data

This code will plot the entire dataset:

We might as well plot a boxplot for context as well:

Isolation Forest

Background:

Isolation forest is a machine learning algorithm for anomaly detection. It's an unsupervised learning algorithm that identifies anomaly by isolating outliers in the data. Isolation Forest is based on the Decision Tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features (in my case, for this analysis there is a single feature 'temperature', and then randomly selecting a split value between the max and min values of that feature. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data. Think of it as if an outlier is naturally easy to 'segment'.

Starting: normally one would construct a profile of what's "normal", and then report anything that cannot be considered normal as anomalous. But isolation forest algorithm does not first define "normal" behavior, and does NOT calculate 'point-based' distances.

Isolation Forest instead works by isolating anomalies in the dataset. The Isolation Forest algorithm is based on the principle that anomalies are observations that are few and different, which should make them easier to identify. Isolation Forest uses an ensemble of Isolation Trees for the given data points to isolate anomalies. Presumably the anomalies need fewer random partitions to be isolated compared to "normal" points in the dataset, so the anomalies will be the points which have a smaller path length in the tree, path length being the number of edges traversed from the root node.

Later I will do multivariate data with this algorithm

Working Code

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

I like to actually understand this stuff:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 out of  24 | elapsed:    0.6s remaining:    7.4s
[Parallel(n_jobs=24)]: Done  24 out of  24 | elapsed:    0.7s finished

tree building process...

Isolation forests are an unsupervised extension of the popular random forest algorithm. The building blocks of isolation forests are isolation trees with a binary outcome (is/is not an outlier).

When an isolation forest is built, the algorithm splits each individual data point off from all other data points. The easier it is to isolate a single point in space from all other points, the more likely it is an outlier (because it’s far away from all other data points). If a data point is an in-lier, it will be closely surrounded by other data points, and will take more splits to isolate (1). See the graphic below as an illustration.

Important

Inliers with positive scores are clubbed at right with positive scores. Points with negative scores are anomalies.

Working Output !

exported the html directly to github and plotting it here (it has full html interactivity !)


I have to comment out the above code, it makes the notebook and html export very very large

IF you redownload this, just uncomment out, its nice, you can see interactive plotly plot of all of the anomalies, it looks like the below:

image.png

OUTPUT:

image.png

NOTE: - IF you had the actual labels, which you don't, you could calculate the accuracy of the model by finding how many outliers the model found divided by how many outliers were present in the data. Maybe someday we could identify that there were truly realms of anomalies like weather patterns, and do this...

Isolation forests are an unsupervised extension of the popular random forest algorithm. The building blocks of isolation forests are isolation trees with a binary outcome (is/is not an outlier).

When an isolation forest is built, the algorithm splits each individual data point off from all other data points. The easier it is to isolate a single point in space from all other points, the more likely it is an outlier (because it’s far away from all other data points). If a data point is an in-lier, it will be closely surrounded by other data points, and will take more splits to isolate (1). See the graphic below as an illustration.

STOP AND BREATHE







Appendix of Code References: